Reading �Roy & Pentland (2002), �Learning words from sights and sounds: a computational model�,

Greg Detre

Tuesday, October 01, 2002

�Learning words from sights and sounds: a computational model�, Deb K. Roy, and Alex P. Pentland, Cognitive Science Volume 26, Issue 1, January-February 2002, Pages 113-146

Abstract

This paper presents an implemented computational model of word acquisition which learns directly from raw multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent cross-modal structure. The model has been implemented in a system using novel speech processing, computer vision, and machine learning algorithms. In evaluations the model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects. These results demonstrate the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling.

Notes

1. Introduction

CELL stands for Cross-channel Early Lexical Learning

Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent intermodal structure

CELL has been implemented for the task of learning shape names from a database of infant-directed speech recordings which were paired with images of objects

2. Problems of early lexical acquisition

CELL addresses three inter-related questions of early lexical acquisition. First, how do infants discover speech segments which correspond to the words of their language? Second, how do they learn perceptually grounded semantic categories? And tying these questions together: How do infants learn to associate linguistic units with appropriate semantic categories?

Within CELL, these three problems are treated as different facets of one underlying problem: to discover structure across spoken and contextual input.

3. Background

The CELL model addresses problems of word discovery from fluent speech and word-to-meaning acquisition within a single framework. Although computational modeling efforts have not explored these problems jointly, there are several models which treat the problems separately.

Our goal of word discovery refers to the problem of discovering some words of the underlying language from unsegmented input. Our task is thus a subtask of complete segmentation. Models of complete segmentation are nonetheless interesting as indicators of the extent to which segmentation may be performed by analysis of speech input alone.

Speech segmentation models may be divided into two classes

One class attempts to detect word boundaries based on local sound sequence patterns or statistics

This hypothesis is supported by infant research which has shown that 8-month old infants are sensitive to transition probabilities of syllables suggesting that they may use these cues to aid in segmentation (Saffran, Aslin & Newport, 1996).

A second class of segmentation algorithms explicitly model the words of the language � ???

CELL is concerned with learning words whose referents may be learned from direct physical observations. The current instantiation of the model, however, does not address learning words which are abstractly defined or difficult to learn by direct observation. Nonetheless, acquisition of word meaning in this limited sense is not trivial. There are multiple levels of ambiguity when learning word meaning from context. First, words often appear in the absence of their referents, even in infant-directed speech. This introduces ambiguity for learners attempting to link words to referents by observing co-occurrences. Ambiguity may also arise from the fact that a given context may be interpreted in numerous different ways (Quine, 1960). Even if a word is assumed to refer to a specific context, an unlimited number of interpretations of the context may be logically possible. Further ambiguities arise since both words and contexts are observed through perceptual processes that are susceptible to multiple sources of variation and distortion. The remainder of this section discusses approaches to resolving such ambiguities.

The mutual information between the occurrence of a word and a shape or color type computed from multiple observations was used to evaluate the strength of association. CELL employs a cross-situational strategy to resolve word-referent ambiguity using mutual information.

By not representing texture, weight and countless other potential attributes (and combinations of attributes) of an object, implicit constraints were placed on what was learnable.

Regier (1996) developed a model for learning spatial words ("above", "below", "through", etc.) by presenting a neural network with line drawing animations paired with word labels

For the experiments reported in this paper, CELL does not address Quine's dilemma since only one type of contextual attribute, object shape, is represented.

A variety of statistical pattern recognition techniques exist for representing and classifying noisy signals. In CELL, both neural networks and density estimation are employed to deal with variations in acoustic and visual signals.

The problem of segmenting fluent speech to discover words is addressed jointly with the problem of associating words with co-occurring referents. CELL is compatible with models which treat each problem separately, but brings to light the advantage of leveraging partial evidence from each task within a joint framework.

4. The CELL model

A schematic of the CELL model is presented in Fig. 1. The model discovers words by searching for segments of speech which reliably predict the presence of visually co-occurring shapes. Input to the system consists of spoken utterances paired with images of objects. This multimodal input approximates the stimuli that infants may receive when listening to caregivers while visually attending to objects in the environment. A short-term memory (STM) buffers recent utterance-shape pairs for a brief period of time. Each time a new observation is entered into the STM, a short-term recurrence filter generates hypotheses of word-shape associations by segmenting and extracting speech subsequences from utterances in the STM and pairing them with co-occurring shape observations. These hypotheses are placed in a long-term memory (LTM). A second filter operates on the LTM to consolidate reliable hypotheses over multiple observations. A garbage collection process eliminates unreliable hypotheses from the LTM. Output of the model is a set of {speech segment, shape} associations.

Learning is an on-line process in CELL. Unprocessed input is stored temporarily in STM. Repeated speech and visual patterns are extracted by the recurrence filter, and the remaining information is permanently discarded. As a result, CELL has only limited reliance on verbatim memory of sensory input.(???)

made the following simplifying assumptions:

The learner possesses a short term memory (STM) which can store approximately 10s of speech represented in terms of phoneme probabilities. The STM also stores representations derived from co-occurring visual input.

The visual context of an input utterance is a single object

Only the shape of objects are represented. Other attributes such as color, size, texture, and so forth are not represented in the current implementation.

There are built-in mechanisms for generating and comparing phonemic representations derived from the acoustic waveform. In this model, coarse level phonemic representation occurs prior to word learning. Similarly, the model also has built-in mechanisms for extracting and comparing shape representations.

4.1. Representing and comparing spoken utterances

Spoken utterances are represented as arrays of phoneme3 probabilities (Fig. 2). Acoustic input is first converted into a spectral representation using the Relative Spectral-Perceptual Linear Prediction (RASTA-PLP) algorithm (Hermansky & Morgan, 1994). RASTA-PLP is designed to attenuate nonspeech components of an acoustic signal. It does so by suppressing spectral components of the signal which change either faster or slower than speech

A recurrent neural network analyses RASTA-PLP coefficients to estimate phoneme and speech/silence probabilities. The RNN has 12 input units, 176 hidden units, and 40 output units. The 176 hidden units are connected through a time delay and concatenated with the RASTA-PLP input coefficients. Thus, the input layer at time t consists of 12 incoming RASTA-PLP coefficients concatenated with the activation values of the hidden units from time t -1. The time delay units give the network the capacity to remember aspects of old input and combine those representations with fresh data. This capacity for temporal memory has been shown to effectively model coarticulation effects in speech (Robinson, 1994). The RNN was trained off-line using the TIMIT database of phonetically transcribed American English speech recordings (Garofolo, 1988). ???

To locate approximate phoneme boundaries, the RNN outputs are treated as state emission probabilities in a Hidden Markov Model (HMM) framework.

After Viterbi decoding of an utterance, the system obtains (1) a phoneme sequence, the most likely sequence of phonemes which were concatenated to form the utterance and (2) the location of each phoneme boundary for the sequence. Each phoneme boundary serves as a speech segment start or end point. Any subsequence within an utterance terminated at phoneme boundaries is used to form word hypotheses. Each candidate is guaranteed to consist of one or more syllables consisting of a vowel and consonant or consonant cluster on either side of the vowel.

A distance metric, d_A( ), measures the similarity between two speech segments. It is possible to treat the phoneme sequence of each speech segment as a string and use string comparison techniques. This method has been applied to the problem of finding recurrent speech segments in continuous speech (Wright, Carey & Parris, 1996). A limitation of this method is that it relies on only the single most likely phoneme sequence.

The speech distance metric defined by Eq. (1) measures the similarity of phonetic structure between two speech sounds. The measure is the product of two terms: the probability that the HMM extracted from observation A produced observation B, and vice versa. Empirically, this metric was found to return small values for words which humans would judge as phonetically similar.(???)

4.2. Representing and comparing visual input

Three-dimensional objects are represented using a view-based approach in which two-dimensional images of an object captured from multiple viewpoints collectively form a visual model of the object. Fig. 5 shows the stages of visual processing used to extract representations of object shapes. An important aspect of the shape representation is that it is invariant to transformations in position, scale and in-plane rotation.

Figure-ground segmentation is simplified by assuming a uniform background. A Gaussian model of the illumination-normalized background color is estimated and used to classify background/foreground pixels. Large connected regions near the center of the image indicated the presence of an object.

Based on methods developed by Schiele and Crowley (1996), objects are represented by histograms of local features derived from multiple two-dimensional views of an object. Shape is represented by locating all boundary pixels of an object in an image.

For each pair of boundary points, the normalized distance between points and the relative angle of the object edge at each point are computed. Each two-dimensional (relative angle, relative distance) data point is accumulated in a two-dimensional histogram to represent an image. (???)

Each three-dimensional object is represented by 15 shape histograms. A set of two-dimensional histograms representing different view points of an object is referred to as a view-set

In contrast to the speech input which is taken directly from the infant-caregiver recordings, the visual data are generated off-line since processing the original video from the infant-caregiver interactions would be a much more difficult computer vision problem. Even with the simplified visual input, shape representations are nonetheless susceptible to various forms of error including shadow effects and foreground/background segmentation errors. In addition, some shape classes such as dogs and horses are highly confusable, and other classes such as trucks (which include pickup trucks and fire trucks) are quite varied. Working with sensor-derived data are motivated by our goal of developing computational learning systems which do not rely on manual annotation of any input data.

4.3. Word learning

Word learning is achieved by processes which operate on two levels of memory, short term memory (STM) and long term memory (LTM). Input representations of spoken utterances paired with visual objects are temporarily stored in the STM. A {spoken utterance, object pair} is referred to as an audio-visual event or AV-event. Each entry of the STM contains a phoneme probability array representing a multiword spoken utterance, and a set of histograms which represent an object.

For each legal speech segment (legal segments contain at least one vowel) in the newly received AV-event, the filter searches for matches with each legal segment in the remainder of the STM which also have matching visual contexts.

AV-prototypes are generated based on local recurrency within STM, they are prone to errors. For example, the spoken word "the" might occur repeatedly in the STM in the context of a ball.

The second type of data structure found in LTM are lexical items. Lexical items are created by consolidating AV-prototypes based on a mutual information criterion.(???)

A lexical item models a speech-shape association. A speech prototype specifies the ideal or canonical form of the speech sound. The acoustic radius specifies the allowable acoustic deviation from this ideal form. Similarly, the shape-prototype specifies the ideal shape to serve as a referent for the lexical item and the visual-radius specifies allowable deviation from this shape. The radii are a necessary component of the representation to account for natural variations inherent in acoustic and visual input.

Fig. 8. Mutual information plotted as a function of acoustic and visual radii for two speech segments which were both paired with view-sets of a toy dog. In the example on the left, the word "yeah" was linked to the dog since the word recurred in the STM in the presence of a dog. However this hypothesis found little support from other AV-prototypes in LTM which is indicated by the low flat mutual information surface. In contrast, in the example on the right, the word "dog" was correctly paired with a dog shape. In this case there was support for this hypothesis as indicated by the strongly peaked structure of the mutual information surface. CELL detects peaks such as this one using a fixed threshold. The point at which the surface peaks is used to determine the optimal settings of the radii. These radii along with the AV-prototype lead to a new lexical item. This is how CELL learns words from sights and sounds.

The mutual information quantifies the amount of information gained about the presence (or absence) of one category given that we know whether the associated category is present or not.

Since we have no a priori knowledge of how to set these radii, a search is performed to find the settings of both radii which maximizes the mutual information

5. Experimental results

This section describes an evaluation of CELL using infant-directed speech and visual images. We gathered a corpus of infant-directed speech from six caregivers and their preverbal infants. Participants were asked to engage in play centered around toy objects. Caregiver speech recordings and sets of camera images of toys were used as input to CELL.

Infants ranged in age from eight to eleven months. Each participant confirmed that their child could not yet produce single words. However, they reported varying levels of limited comprehension of words (e.g., their name, no, dog, milk, wave, etc.).

5.2. Objects

Caregivers were asked to interact naturally with their infants while playing with a set of age-appropriate objects. We chose seven classes of objects commonly named by young children (Huttenlocher & Smiley, 1994): balls, toy dogs, shoes, keys, toy horses, toy cars, and toy trucks. A total of 42 objects, six objects from each class, were used (Fig. 9).

5.3 Protocol

5.4 Speech data

A total of 36 sessions of speech recordings were obtained (6 participants, 6 sessions per participant). Utterances were automatically segmented based on the silence/nonsilence probability estimate generated by the recurrent neural network. On average, each speaker produced a total of about 1,300 utterances. Each utterance contained approximately five words. About one in 13 words referred directly to an object and could thus be grounded in terms of visual shapes. This provides a rough indication of the level of filtering which had to be performed by the system to extract meaningful speech segments from the recordings since the large majority (92%) of words were not directly groundable in terms of visual input. Some sample utterances from one of the participant are shown in Table 1. As would be expected, many utterances contain words referring to the object in play. Note, however, that many utterances do not contain direct references to the object in view, and occasionally even contain references to other objects used in the experiment but not in play at the time that the utterance was spoken.

Repetition of words occurred throughout all data sets, despite the fact that participants were not specifically instructed to teach their infants, or to talk exclusively about the objects. They were simply asked to play naturally. A temporal "clumping" effect for salient words was evident. For example, the word ball would appear several times within the span of half a minute because of the focused and repetitive nature of the interaction. Although we did not carefully examine the interaction between focus of attention and recurrence, it seemed that salient words were repeated even more when caregivers and infants were engaged in joint attention with respect to an object. The STM in CELL may be thought of as a buffer which is large enough to capture such temporal clumps of repeated patterns.

5.5 Visual data

5.6. Combining speech and visual data to create AV-events

Caregivers were asked to play with one object at a time. All utterances that were produced between the time when the object was removed from the in-box and placed in the out-box were paired with that object. Spoken utterances were paired with randomly selected view-sets of the corresponding object to generate a set of AV-events.

In reality, however, infants were not always watching the object

5.7. Processing the audio-visual corpus

5.8. Baseline acoustic-only model

The Acoustic-only Model acquires a lexicon by identifying speech segments which recur most often in the input. The model assumes that some underlying language source concatenates words according to a set of unknown rules. Segments of speech which are found to repeat often are assumed to correspond to words of the target language. This is similar to previously suggested models of speech segmentation which are based on detection of recurrent sound patterns

The mutual information filter was replaced with a second recurrence filter which searched for speech segments which occurred often in LTM. In effect, this second recurrence filter identifies speech segments which occurred often across long time spans of input.(???)

5.9. Evaluation measures

5.9.1. Measure 1: segmentation accuracy

Do the start and end of each speech prototype correspond to word boundaries in English?

5.9.2. Measure 2: word discovery

Does the speech segment correspond to a single English word? We accepted words with attached articles and inflections, and we also allowed initial and final consonant errors. For example the words/dg/(dog), /g/(dog, with initial /d/ missing), and /� (the dog), would all be accepted as positive instances of this measure. However /dgIz/ (dog is) would be counted as an error. Initial and final consonant errors were allowed in this measure since we were interested in measuring how often single words were discovered regardless of exact precision of segmentation.

5.9.3. Measure 3: semantic accuracy

If the lexical item passes the second measure, does the visual prototype associated with it correspond to the word's meaning? If a lexical item fails on Measure 2, then it automatically fails on Measure 3.

5.10. Results

Contents of LTM using CELL to process one participant�s data

Rank�� Phonetic transcript�� Text transcript�� Shape category�� Segment. accuracy Word Disc.�� Semantic accuracy

1 .�� u�� shoe�� shoe E�� 1�� 1�� 1

2�� fair .�� fire*�� truck D�� 0�� 1�� 1

3�� r . k�� *truck�� truck C�� 0�� 1�� 1

4�� d . g�� dog�� dog D�� 1�� 1�� 1

5�� I ...�� in the*�� shoe A�� 0�� 0�� 0

6�� ki�� key�� key C�� 1�� 1�� 1

7�� ki�� key�� key E�� 1�� 1�� 1

8�� d . ggi�� doggie�� dog C�� 1�� 1�� 1

9�� b . l�� ball�� ball C�� 1�� 1�� 1

10�� b . l�� ball�� ball A�� 1�� 1�� 1

11�� ki .�� key*�� key C�� 0�� 1�� 1

12�� . u�� a shoe�� shoe B�� 0�� 1�� 1

13�� . n�z�� *and this is�� shoe B�� 0�� 0�� 0

14�� (ono.)�� (engine)�� truck A��

15�� (ono.)�� (barking)�� dog A��

Total�� 54%�� 85%�� 85%

It is interesting to note that CELL did link objects with their appropriate onomatopoeic sounds. They were considered meaningful and groundable by CELL in terms of object shapes. This finding is consistent with infant learning; young children are commonly observed using onomatopoeic sounds to refer to common objects. The only reason these items were not processed further is due to the above stated difficulties in assessing segmentation accuracy.

The confusion between laces and shoes is a classic example of the part-whole distinction (Quine, 1960) which CELL is susceptible to since only whole objects are nameable

CELL out-performed the Acoustic-only Model on all three measures

The low accuracy levels achieved in Measure 1 reflect the inherent difficulty of perfectly segmenting raw acoustic signals

For Measure 2, word discovery, almost three out of four lexical items (72%) produced by CELL were single words (with optional articles and inflections) (Fig. 13). In contrast, using the Acoustic-only Model, performance dropped to 31%. These results demonstrate the benefit of incorporating cross-channel information into the word learning process

On Measure 3, semantic accuracy, we see the largest difference in performance between CELL and the Acoustic-only Model (Fig. 14). With an accuracy of 57%, CELL out-performs the Acoustic-only Model by over a factor of four

The Acoustic-only Model performed well considering the input it received consisted of unsegmented speech alone. It also learned many words which are not acquired by CELL including "go", "yes", "no", and "baby". These are plausible words to enter a young infant's vocabulary. This finding suggests that in addition to cross-channel structure, the learner may also use within-channel structure to hypothesize words for which the meaning is unknown

6. Discussion

In terms of early word learning two theoretic vantage points are often posited. First, perhaps infants learn early concepts and then look for spoken labels to fit the concepts. On the other hand, they might first learn salient speech sequences and then look for their referents. Our model and experiments suggest that a more closely knit process in which these two stages in fact occur together is advantageous for the learner. Attending to the co-occurrence patterns within the visual context may help infants segment speech. Spoken utterances simultaneously act as labels for visual context, enabling the learner to form visual categories which ultimately serve as referents for words. By taking this approach, the learner is able to leverage information captured in the structure between streams of input.

The relatively poor performance [in the segmentation task] may have been increased by adding analysis components which take advantage of prosodic and phonotactic cues

Recent findings suggest that infants' phonetic discrimination performance drops during word learning (Stager & Werker, 1997). Assuming that increased phonetic discrimination aids in speech segmentation, this finding seems to indicate that word learning may in fact interfere with segmentation. (???)

CELL relies on structure across channels of input, but is not tied to the specific channels of speech and shape discussed in this paper. Thus findings that, for example, blind children are able to acquire sight related words (Landau & Gleitman, 1985) does not pose a problem for the model. The underlying processes of CELL should work equally well with other channels and modalities of input 12 although this remains to be tested.

6.1 Assumptions in the model

We assumed that the STM is able to hold approximately 10 s of speech represented in terms of phoneme probabilities. This may be a somewhat optimistic estimate of preverbal working memory capacity. The STM duration may be reduced without significantly reducing learning rates by filtering out some portions of input speech so that only salient portions of the signal enter the STM. Such an approach would require designing filters which are able to reliably select portions of utterances which are more likely to contain semantically salient words. One approach may be to use filters based on prosodic and or nonspeech cues.

To simplify the visual processing, however, we assumed that each spoken utterance is paired with a single object. � Visual selection is an extremely difficult problem which needs to address many complex issues including the analysis of caregiver intent. The supporting visual system would also grow in complexity since it would need to parse complex visual scenes.

During data collection, it was observed that infants were not always visually attending to the target object when an utterance was spoken by the caregiver. During data processing, however we assumed that the infant was attending to the object during each utterance. generalization was made to simplify data preparation and to minimize experimenter preprocessing of the data. Performance may have been improved if we discarded utterances that occurred in cases when the infant was not attending to the target object. In these instances, spoken utterances were less likely to contain direct references to the object and thus acted as noise for the learning system.

A third simplification to the visual input was that only object shape information was available to the model. This forced the model to learn words which could be grounded in shape, and avoided potential problems of trying to simultaneously learn groundings in alternate context channels (or combinations of context channels). This simplification is somewhat justified since young infants are known to have a "shape bias" in that they prefer to assign names to shapes rather than colors and other visual classes (Landau et al., 1988). In previous experiments (Roy, 1999), CELL has simultaneously learned words referring to both shape and color categories. No significant interference problems were found across contextual channels.

A fourth assumption inherent in the visual representation is that the correspondence between different views of the same physical object is given. In other words, when a view-set is generated from an object, CELL assumes that all views in the set belong to the same object. This seems to us to be a reasonable assumption since in real situations, the learner may smoothly shift his or her perspective to gather a set of views of a single object.

Note that the correspondence between two separate view-sets of the same object are not given. Since two view-sets of a shared underlying object will never be identical (due to variations in camera pose, lighting, etc.), a correspondence problem exists for the system at this level. The correspondence problem at the object class level, that is, establishing the correspondence between a view-set representing Car A with that of Car C is yet more difficult and also addressed by CELL. (???)

CELL assumes that built-in mechanisms for representing speech in terms of phonemes and for extracting statistical representations of shapes are available prior to word learning

At least coarse phonemic representations may reasonably be assumed since infants as young as 6 months are able to make language-dependent phonemic discriminations (Kuhl, Williams, Lacerda, Stevens & Lindblom, 1992). The CELL model does not attempt to account for how initial phoneme discrimination abilities arise prior to word learning.

Experimental evidence also supports the assumption that infants have crude object representations and comparison mechanisms (Milewski and Buschnell). The shape representations employed in CELL are based on these findings.

6.2. Sensor grounded input

In contrast CELL receives noisy input both in terms of raw acoustic speech and unannotated camera images. The problems of word learning would have been greatly simplified if CELL had access to � consistent representations.

We believe that using raw sensory input bears closer resemblance to the natural conditions under which infants learn. Infants only have access to their world through their perceptual systems. There is no teacher or trainer who provides consistent and noiseless data for the infant. Similarly, there should be no equivalent teacher or trainer to help the computational model.

Transcriptionists leverage their knowledge of language to overcome ambiguities in the acoustic signal and thus have the potential to influence the model. Pre-existing knowledge is bound to trickle into any model which relies on human-prepared input. In addition, raw speech contains prosodic information which may also provide cues for segmentation and determining points of emphasis. Such information is also lost when the speech signal is reduced to only a phonetic transcript.

Given that CELL operates on sensor data, we expected performance to be somewhat degraded in comparison to computational models which process symbolic input. By using robust signal representations and statistical modeling techniques, however, we were nonetheless able to obtain promising performance.

7. Conclusions

The CELL model represents an important step towards applying methods from signal processing and pattern recognition for the purpose of modeling language acquisition. By using these techniques, models may be implemented which process sensory data without the aid of human annotations or transcriptions. Complex models which involve the interaction of several processes and which are intimately tied to the nature of input may be tested in such computational frameworks. We plan to expand the types of contextual representations available to the model enabling acquisition of richer classes of words and relations between words.

Questions

what�s the second class of segmentation algorithms???

how�s the shape histogram created???

what�s CCD??? is it live video??? presumably not, because there�s no motor output to guide it�???

they�re effectively using linguistic/verbal/auditory input as a normal sensory modality to correlate with visual input � is this a problem??? would they be better off with some other modality???

unfortunately/perhaps it is the best choice for a distal, detailed yet vague, interesting, large-corpus sensory modality�